183 research outputs found
Towards Language-Universal End-to-End Speech Recognition
Building speech recognizers in multiple languages typically involves
replicating a monolingual training recipe for each language, or utilizing a
multi-task learning approach where models for different languages have separate
output labels but share some internal parameters. In this work, we exploit
recent progress in end-to-end speech recognition to create a single
multilingual speech recognition system capable of recognizing any of the
languages seen in training. To do so, we propose the use of a universal
character set that is shared among all languages. We also create a
language-specific gating mechanism within the network that can modulate the
network's internal representations in a language-specific way. We evaluate our
proposed approach on the Microsoft Cortana task across three languages and show
that our system outperforms both the individual monolingual systems and systems
built with a multi-task learning approach. We also show that this model can be
used to initialize a monolingual speech recognizer, and can be used to create a
bilingual model for use in code-switching scenarios.Comment: submitted to ICASSP 201
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful
parameter initialization prior to training. Otherwise, the networks may fail to
find a good local optimum. This is particularly true for online networks, such
as unidirectional LSTMs. Currently, the best strategy to train such systems is
to bootstrap the training from a tied-triphone system. However, this is time
consuming, and more importantly, is impossible for languages without a
high-quality pronunciation lexicon. In this work, we propose an initialization
strategy that uses teacher-student learning to transfer knowledge from a large,
well-trained, offline end-to-end speech recognition model to an online
end-to-end model, eliminating the need for a lexicon or any other linguistic
resources. We also explore curriculum learning and label smoothing and show how
they can be combined with the proposed teacher-student learning for further
improvements. We evaluate our methods on a Microsoft Cortana personal assistant
task and show that the proposed method results in a 19 % relative improvement
in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
End-to-end (E2E) spoken language understanding (SLU) systems that generate a
semantic parse from speech have become more promising recently. This approach
uses a single model that utilizes audio and text representations from
pre-trained speech recognition models (ASR), and outperforms traditional
pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems
still show weakness when text representation quality is low due to ASR
transcription errors. To overcome this issue, we propose a novel E2E SLU system
that enhances robustness to ASR errors by fusing audio and text representations
based on the estimated modality confidence of ASR hypotheses. We introduce two
novel techniques: 1) an effective method to encode the quality of ASR
hypotheses and 2) an effective approach to integrate them into E2E SLU models.
We show accuracy improvements on STOP dataset and share the analysis to
demonstrate the effectiveness of our approach.Comment: INTERSPEECH 202
- …